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Abstract 



o 

(N 

^^1 Transductive methods are useful in prediction problems when the training dataset is 

Cd ' composed of a large number of unlabeled observations and a smaller number of labeled 

observations. In this paper, we propose an approach for developing transductive prediction 

procedures that are able to take advantage of the sparsity in the high dimensional linear 

regression. More precisely, we define transductive versions of the LASSO |Tib96| and the 

Dantzig Selector f CT07| . These procedures combine labeled and unlabeled observations 

f-H I of the training dataset to produce a prediction for the unlabeled observations. We propose 

^^ ' an experimental study of the transductive estimators, that shows that they improve the 

(-H I LASSO and Dantzig Selector in many situations, and particularly in high dimensional 

problems when the predictors are correlated. We then provide non-asymptotic theoretical 

guarantees for these estimation methods. Interestingly, our theoretical results show that 

the Transductive LASSO and Dantzig Selector satisfy sparsity inequalities under weaker 

assumptions than those required for the "original" LASSO. 



> 

0\ , 1 Introduction 

(N 

00 ■ 

^-^ , In many modern applications, a statistician often have to deal with very large datasets. They 

may involve a large number p of covariates, possibly larger than the sample size n. Let an 
^^ ' observation be a pair instance-label. In this paper, we tackle such high dimensional settings 

^^ . which moreover involve a large amount of unlabeled data (say m instances) in addition to the 

n labeled observations. 

In contrast to inductive or supervised methods, transductive procedures exploit the knowl- 
^ ■ edge of the unlabeled data to improve prediction. It is argued in the semi-supervised learning 

literature (see for example |CSZ06| for a recent survey) that taking into account the infor- 
mation on the design given by the new additional instances has a stabilizing effect on the 
estimator. In transductive methods, we furthermore take into account the objective of the 
statistician: estimation of the value of the regression function only on the set of unlabeled data; 
see Vapnik |Vap98a| for a pioneer work. To leverage unlabeled data, the transductive or semi- 
supervised methods exploit the geometry of the marginal distribution. Initially introduced 
in the classification framework, the good performance of these methods has been observed 
in several practical fields. From a theoretical point of view we refer to the transductive ver- 
sion of SVM |Vap98 b, "Joa99", "CZA05", WSP07]. to the study of classifiers under the clustering 
assumption |BM98[ ,NMTM99. ,CS99. .ZBL+O.Sl IXCtLOSI IAZ05) . and to transductive versions 



*LPMA (Univ. Paris 7), CREST-LS. 
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of the Gibbs estimators |Cat07| among many others. According to the appHcations arrays, 
see for instance the detection of spam email [ BM981 IAG031 |BBC"'"05 , genetics apphcations 
|XP05| . but also the well-known Netflix challenge. 

In this paper we focus on the linear regression model in the case p > n. In this setting 
dimension reduction is a major issue and can be performed through the selection of a small 
amount of relevant covariates. Numerous inductive or supervised methods have been proposed 
in the literature, ranging from the classical information criteria such as AIC |Aka73| and BIG 
|Sch78) to the more recent ^i-regularized methods, as the LASSO |Tib96) . the Dantzig Selector 
|(]Tn7) . the non-negative garrote |YL07| . We also refer to |Kol09al [KdOObl IMVdC^tBOQl IvdC^tOSI 
IDT071 ICH08| for related works. Such regularized regression methods have recently witnessed 
several developments due to the attractive feature of computational feasibility, even for high 
dimensional data {i.e., when the number of covariates p is large). Formally we assume 

yi = Xi(3*+ei, i = l,...,n, (1) 

where the design Xj = (xj^i, . . . ,Xj^p) G W is deterministic, /3* = (/^J', . . . ,/3*)' G W is the 
unknown parameter and ei, . . . ,e„ are i.i.d. centered Gaussian random variables with known 
variance o"^. Let X denote the matrix with i-th line equal to Xj, and let Xj denote its j-t\i 
column, with i G {1, . . . , n} and j G {1, . . . ,p}. In this way, we can write 



X = ix[,...,x'J = iXi,...,X,^ 



pj- 



For the sake of simplicity, we will assume that the observations are normalized in such a way 
that X'-Xj/n = 1. We denote by Y the vector Y = (j/i, . . . , y„)'. 

Let Xn+i, ■ ■ ■ Xm be observed unlabeled instances with Xi G W for n + \ < i < m (with 
m> n). Let moreover Z = (x'^, . . . , x^)'. 

For all a > 1 and any vector v G M"^, we set || • ||q,, the ^Q,-norm: ||f ||o = (|fi|° + . . . + 
l^'dl")^ • 111 particular || • ||2 is the euclidean norm. Moreover for all d-dimensional vector v 
with d G IN, we use the notation ||t;||o = Yli=i '^{''^i 7^ 0)- 

From a theoretical point of view, Sparsity Inequalities (SI) have been proved for the regular- 
ized estimators mentioned above under different assumptions, in the inductive setting only, i.e., 
without the knowledge of the matrix Z . That is upper bounds of order ofO (o"^||/?*||o log {p)/n\ 
for the errors (l/n)||X/3 — X/3*||2 and ||/3 — /3*||2 have been derived, where j3 is one of those esti- 
mators. Such bounds involve the number of non-zero coordinates in (3* (multiplied by log(p)), 
instead of dimension p. Such bounds guarantee that under some assumptions, X(3 and (3 are 
good estimators of X(5* and (3* respectively. For the LASSO, these SI are given for example 
in IBTW07I IBRT09) . whereas |(:;T07[ IBRT09) provided SI for the Dantzig Selector. On the 
other hand, Bunea |Bun08| established conditions which ensure that the LASSO estimator 
and (5* have the same null coordinates. Analog results for the Dantzig Selector can be found 
in [Lou08] . An important issue when we establish these theoretical results is the assumption 
that is needed on the Gram matrix n~'^X' X. We refer to [vdGB09j for a nice overview of 
these assumptions. 

In this paper we are interested in the estimation of Z/3* : namely, we care about predicting 
what would be the labels attached to the additional Xj's. Hence we develop transductive 
versions of the LASSO or the Dantzig Selector to tackle the problem of estimating the vector 
Z/3* in the high dimensional setting. According to |Vap98a| this estimator should differ 
from an estimator tailored for the estimation of (3* or Xf3* like the LASSO. Indeed, a naive 
plug-in method would be to build an estimator I3{X, Y) and then to compute Z/3(X, Y) to 



estimate ZP*. We rather consider here an approach where the estimators I3{X,Y,Z) exploit 
the knowledge of Z, and finally compute Z(3{X, Y, Z). These transductive procedures observe 
several interesting properties: 



• 



they take advantage of the unlabeled points to satisfy Sis with weaker assumptions on 
the Gram matrix than those required for the usual inductive methods (cf. the examples 
of Section 14.311 ; 



• they perform well in practice compared to the inductive methods in most of the situa- 
tions. This is illustrated by a comparison between the performance of the Transductive 
LASSO and the LASSO. 

Let us mention that the study established in this paper consists in a generalization of 
the LASSO and the Dantzig Selector. They actually do not only consider the transductive 
setting since they can be adapted to other objectives desired by the statistician. Indeed, the 
estimators depend on a q x p matrix A, with g E IN, whose choice allows to consider the 
problem of the estimation of A(3* . In this way, the estimation of Z(3* appears as a particular 
case. 

The rest of paper is organized as follows. In the next section, we give the definition of 
the considered estimators. We then display, in Section |3l a set of experiments that show how 
the Transductive estimators can improve on the LASSO and the Dantzig Selector in many 
applications. A non-asymptotic study is provided in Section U whereas all the proofs of the 
theorems are postponed to Section [6l 

2 Definition of the estimators 

Here we define the family of estimators we consider in the sequel and more specifically the 
Transductive LASSO and the Transductive Dantzig Selector. 

2.1 Definitions 

Let us first remind that the LASSO estimate can be defined by 

/3f = argmin{||y-X/3||2 + 2A||/3||i}, (2) 

where A is a positive tuning parameter. Let us make simple remarks to optimize comprehen- 
sibility of the paper and the notation inside. Since Y = Xj3* + e, the response vector Y can 
be seen as an estimator for XP*. Then let us write Y = X/3* to convey this fact. Actually, 
if cr ~ 0, y could even be a good estimator. However, in the general case, it is not expected 
to be a particularly interesting estimator and then Y = Xf5* is only used as a preliminary 
estimator of the vector Xj3* . Based on this preliminary estimate, the LASSO defined by ([2]) 
ensures in the case where /?* is sparse, a good estimation of Xp* by XP^ (cf. |BRT09j for 
instance) . 

Let us now generalize the previous comments. Let AP* be a quantity of interested for a 
general (and given) qx p matrix A with g G N*. Then an analog of the LASSO estimator ([2]) 
can be given by the following definition. 



Definition 2.1 (The Transductive LASSO). Let A/S* be a preliminary estimator (that can be 
a very poor estimator) of Aj3* and define 



I3a,\ = arg min 



AI3* - AI5 



+ 2A 

2 



In particular, when A = \JnlmZ , the estimator Pa^x is the Transductive LASSO. 
The Dantzig Selector is defined as: 



{arg min/3g]Rp 
(3) 
s.t.\\X'{Y-Xmoo<^^ 
where A is a positive tuning parameter. In the same way as for the LASSO estimator, we 
underHne the role of y = Xf3* and propose the following definition. 

Definition 2.2 (The Transductive Dantzig Selector). Let Af3* be a preliminary estimator of 
A(3* and define 

argmm^gRP \\p\\-^ 



( 



s.t. 



A' {A/3* -AI3) 



< X. 



In particular, when A = y^n/mZ, the estimator /3a, A ^s the Transductive Dantzig Selector. 

In Definitions 12.11 and 12. 2| the matrix A is not specified. Hence, we cover here a general 
objective. However, we mainly focus in this paper on the transductive setting. Then our 
principal study deals with the estimation of Z/3*. In other works, this means that we set in 
the above definitions A = y^n/mZ, the unlabeled data matrix (with a normalization term 
^Jrnjn that is here for the sake of convenience, see Section U]). An important issue in this 
paper is also the preliminary estimator Afi* that should be used. An explicit condition on 
this estimator can be found in Section S) It ensures the good theoretical performance for fiA,\ 
and f^A.X- Let us now propose some examples of preliminary estimators. 

Examples 2.1. i) The most simple idea is to estimate (3* by the least square estimator. Even 

ifp > n, that implies that (X'X) is not invertible, we can choose any pseudo-inverse (X'X) 
of (X'X) and use in this way 

Af3* = a(jCX) X'Y, 

as preliminary estimator of A/3* . Remark that ifKei{A) C Ker(A'), this quantity is uniquely 

defined (it does not depend on the choice of the particular pseudo-inverse (X'X) ). We will 
see later that in this case, we may have theoretical guarantees for the performance of /3a,\ and 

/3aa- 

ii) One may think about more sophisticated regularization procedures, based for instance on 

AP* = A{jA'A + X'Xy^X'Y, 

for (a small) 7 > when the matrix A' A is invertible (in the idea of ridge regression). 

Hi) Finally, practitioners may prefer to use as a preliminary estimator something known to 

work well in practice, like: 

A^*=A^{,, 

with A' > 0. We pay a particular attention to this initial estimator in the rest of the paper. 



2.2 A discussion on the matrix A 

These novel estimators depend on two tuning parameters. First A > is a regularization 
parameter, it plays the same role as the tuning parameter involved in the LASSO and will be 
discussed in our simulations and our theoretical results. The second one is the matrix A, that 
allows to adapt the estimator to the objective of the statistician. In this paper, we are mainly 
interested in the following objective: 



• transductive objective: the estimation of Zf3*, by Z(3ji^^x or Z/3a,x with A = ^JnjmZ. 
Note that in the case n < p < rra, it is possible that the matrix Z' Z is invertible, while 
X' X may not. 

Other choices of the matrix A are possible and help to deal with other objectives. We display 
here two additional feasible and well-known choices: 

• denoising objective: the estimation of X/3*, that is a denoised version of Y . For this 
purpose, we consider the estimator /3a, A, with A = X. In this case, if we keep Y as our 
preliminary estimator of X/3*, our estimators are exactly the LASSO and the Dantzig 
Selector, so this case is not of particular interest in this paper; 

• estimation objective: the estimation of /?* itself, by /3a, a > with A = y/nlp where Ip is 
the identity matrix of size p. 

Thanks to the unifying notation A, the theoretical performance of the estimators based on 
the above choices are considered in the same time in Section HI However, we mention that the 
main contribution relates to the transductive objective. 

3 Experimental results 

In this section we compare the empirical performance of the Transductive LASSO and the 
LASSO estimators on simulated and real datasets according to the transductive task. We 
consider both low and high dimensional simulated data. The real dataset comes from a genetic 
study, devoted to learn the complex combinatorial code underlying gene expression. More 
precisions are given in Section [3.31 In this dataset, there are p = 666 predictor variables and 
the total number of available labeled data is 2587. The conclusions of our experiments is that 
the transductive LASSO outperforms the LASSO estimator in most settings and specifically 
when the variables are correlated. 

3.1 Implementation 

Since the paper of Tibshirani |Tib96j . several effective algorithms to compute the LASSO have 
been proposed and studied (for instance Interior Points methods |KKL"'"07| . LARS [EHJT04J, 
Pathwise Coordinate Optimization |FHHT07] . Relaxed Greedy Algorithms |HCB08| ). For the 
Dantzig Selector, a linear method was proposed in the first paper |CT07| . The LARS algo- 
rithm was also successfully extended in |JRL09| to compute the Dantzig Selector. 
Note that these methods allow to compute our estimators /3a, a a^^d Pa,\ as they just appear 
as the LASSO and the Dantzig Selector computed on modified data. Namely, after the com- 
putation of the preliminary estimator Af3* of Af3* , the transductive estimator /3a, a is obtained 
as a usual LASSO solution where the usual data {X,Y) are replaced by {A,A(3*). 



We use the version of the Transductive LASSO proposed in Section [2?T] based on the LASSO 
as preliminary estimator. That is, we set A = Z in Definition 12.11 and refer to Example 12. ll -iii) 
for the definition of the preliminary estimator AP* . In other words, for a given Ai, we first 
compute the LASSO estimator /3x,Ai = f^x ■ Iii this way we have Zf3* = Zj3j^ . However, 
recalling that Z = [x'^, . . . ,x^, . . . ,x^)', it is worth noting that we should keep the n first 
components of Z/3* equal to Y . Indeed the n first components correspond to the labeled 
samples and then do not require to be replaced by an estimation. Given this adjustment, the 
Transductive LASSO is given by 



r^(Ai,A2) 



argmin^eRP^||Z/3||2 



s.t. 



^z\zhi, - Zfi) 



< X-) 



for a given A2 (cf. Section H] for a theoretical study of this estimator). Let us mention that the 
good performance of this estimator are stated in Theorem 14.61 under some assumptions. We 
compare this two steps procedure with the procedure obtained using the usual only LASSO 
/3^ = I3x,\ for a given A that may differ from Ai. In both cases, the solutions are computed 
using the glmne\\ package, introduced by Friedman et al., to provide the LASSO solution, 
between others. We compute j3^ and /3 (Ai,A2) for (A,Ai,A2) G A^ where A is some grid 
defined in a data driven way by the glmnet algorithm. In all the experiments, we choose the 
best tuning parameters. That is, the choice is based on the truth. In other words, we only 
compare the oracle in our family of estimators. This way to select the tuning parameter is 
even suitable in our real data experiments. Indeed, the initial data we get consist only in 
labeled data. We then hide many responses values and construct the estimators without their 
knowledge. Finally the best estimators (tuning parameters) are chosen based on those hidden 
responses. 

3.2 Synthetic data 

The comparison between the LASSO and the Transductive LASSO is made through the study 
of the distribution of 



PERF{Z) 



mmxeA\\Z0^ - /3*)\\l 



over 100 replications for each experiment. Since the LASSO is a special case of the Trans- 
ductive LASSO (with A2 = 0), PERF(Z) belong to [0, 1]. This ratio measures the improve- 
ment made by the Transductive LASSO according to the transductive objective. The smaller 
PERF{Z) is, the more attractive the use of the Transductive LASSO is. Analog study can be 
considered to compare the performance of the Transductive LASSO and the LASSO in term 
of the denoising and the estimation tasks respectively based on the ratio 



PERF{X) 



min(;,^^;,,)gA2||X(/3^^(Ai,A2)-/3*)||i 
mmxeA \\X 0^ - f3*)\\l 



^The algorithm is implemented with R and can be found in the web page: 



http://cran.r-project.org/web/packages/glmnet/index.html 



Table 1: Evaluation of the mean (Mean), the median (Med) and the quantile Q3 of order 0.3 
of the quantities PERF{Z), PERF{X) and PERF{I), when the methods are used in the 
artificial low dimensional case p < n. 













PERF{Z) 


PERF(X) 


PERF{I) 1 


p 


s 


(n, m) 


P 


a' 


Mean 


Med 


O3 


Mean 


Med 


Q3 


Mean 


Med 


Q3 


8 


1 


(10,30) 


0.1 


1 


0.64 


0.77 


0.47 


0.62 


0.70 


0.44 


0.64 


0.75 


0.42 


8 


1 


(10,30) 


0.9 


25 


0.57 


0.54 


0.11 


0.63 


0.47 


0.11 


0.66 


0.61 


0.10 


50 


1 


(60, 80) 


0.1 


1 


0.73 


0.81 


0.59 


0.74 


0.82 


0.62 


0.72 


0.77 


0.60 


50 


1 


(60, 200) 


0.9 


1 


0.70 


0.86 


0.63 


0.69 


0.83 


0.60 


0.68 


0.80 


0.56 


8 


3 


(10,30) 


0.1 


1 


0.85 


0.91 


0.84 


0.83 


0.89 


0.81 


0.83 


0.93 


0.78 


8 


3 


(10,30) 


0.9 


100 


0.74 


0.84 


0.70 


0.71 


0.80 


0.59 


0.75 


0.79 


0.61 


8 


3 


(10,100) 


0.5 


1 


0.90 


0.98 


0.91 


0.89 


0.99 


0.88 


0.87 


0.95 


0.85 


8 


3 


(10, 100) 


0.9 


25 


0.75 


0.88 


0.64 


0.72 


0.80 


0.60 


0.74 


0.82 


0.68 


50 


20 


(100, 120) 


0.1 


1 


0.98 


1 


0.98 


0.98 


1 


0.98 


0.98 


1 


0.98 


50 


20 


(100, 120) 


0.9 


1 


0.76 


0.75 


0.68 


0.58 


0.56 


0.51 


0.96 


1 


1 



and 



PERF{I) 



mm 



(Ai,A2)GA2 



3TL 



(Ai,A2 



o*\\2 
P II2 



mmAeA 



3f 



13* 



Data description. We consider several simulations from the linear regression model 

Vi = Xi(3* +ei, 

for i € {1, . . . ,n}, /3* G MP and the Si are i.i.d. A/'(0,o"^). The design matrix comes from 
a centered multivariate normal distribution with covariance structure Cov{Xj;Xk) = p~\^^^\ 
with p e]0, 1[. Dimension p, sample sizes (n, m), noise level a and correlation parameter p are 
left free. They will be specified during the experiments in order to check the robustness of 
the results. The regression vector /3* is s-sparse where s > 1 is an integer and corresponds to 
the sparsity index. That is, /3* consists in s non-zero components. Since the LASSO and the 
Transductive LASSO do not take care of the ordering of the variables let us define /3* such as 
its s first components are non-zero and equal 5. 

Our study covers several combinations of the parameters p, s, {n,m), p and o"^. In the next 
paragraph, we examine the performance of each estimator according to the value of the regu- 
larization parameters. 

Results. We consider separately the low and the high dimensional case. 



The low dimensional case. Several examples have been studied and we illustrate the perfor- 
mance of the methods in this case through some specific experiments. The main parameter 
seems to be the sparsity index s. More precisely, the behavior of the Transductive LASSO 
compared to the LASSO is related to how large the sparsity index is in comparison to the 
dimension p. This is illustrated in Table [H where the two cases are separated by two horizontal 
lines. Hence, when s is small, we notice a good improvement while using the Transductive 
LASSO instead of the LASSO estimator. This is displayed in lines 1 to 3 of Table [H where all 
of the ratios PERF{Z), PERF{X) and PERF{I) are most of the time between 0.5 and 0.8. 



Table 2: Evaluation of the mean (Mean), the median (Med) and the quantile Q3 of order 0.3 
of the quantities PERF{Z), PERF{X) and PERF{I), when the methods are used in the 
artificial high dimensional case p > n. 













PERF{Z) 


PERF{X) 


PERF{I) 1 


p 


s 


{n,m) 


P 


O" 


Mean 


Qi 


Med 


Mean 


Med 


Qs 


Mean 


Med 


Qs 


10 


8 


(5,15) 


0.1 


1 


0.73 


0.77 


0.64 


0.33 


0.23 


0.14 


0.73 


0.74 


0.65 


10 


8 


(5,15) 


0.9 


1 


0.78 


0.84 


0.71 


0.45 


0.42 


0.29 


0.78 


0.80 


0.73 


10 


8 


(5,15) 


0.9 


100 


0.79 


0.82 


0.71 


0.70 


0.77 


0.55 


0.79 


0.81 


0.72 


1000 


50 


(20, 60) 


0.1 


1 


0.94 


1 


1 


0.95 


1 


1 


0.97 


1 


1 


1000 


50 


(20, 60) 


0.9 


1 


0.60 


0.53 


0.36 


0.44 


0.48 


0.33 


0.78 


0.82 


0.65 


1000 


50 


(100, 200) 


0.1 


1 


0.98 


1 


1 


0.87 


0.89 


0.79 


0.99 


1 


1 


1000 


50 


(100, 200) 


0.9 


1 


0.49 


0.45 


0.38 


0.35 


0.32 


0.25 


0.90 


0.97 


0.86 


1000 


50 


(100, 200) 


0.9 


25 


0.50 


0.46 


0.39 


0.43 


0.40 


0.30 


0.86 


0.91 


0.80 



We remark also that the performance of the Transductive LASSO are even better when the 
parameters p and a increase (line 2). On the other hand, when the sparsity index s is large 
(with respect to the dimension p), it turns out that the Transductive LASSO does not improve 
enough the LASSO estimator (lines 7 and 9) when p is large, whereas it is still satisfying for 
small values oi p. By poor improvement, we mean that min/^)^^ ^j^^^gA^ ||-^(/3 (Ai, A2) — /?*)||2 



is not far from min 

dL p*\\\2 



Z(/3^^(Ai,0)-/3* 



2 which coincides with the LASSO error 



(Ai,0)eA2 
mm^jgA W^if^x ~/3*)ll2- ^^ important observation is that even when s is large and in the case 
where the variables are highly correlated, that is when p is large, the Transductive LASSO 
can be a good alternative to the LASSO estimator (lines 6, 8 and 10). This observation is true 
for both of the transductive error ratio PERF{Z) and the denoising error ratio PERF(X). 
On the other hand, even with high correlations between variables, the Transductive LASSO 
does not make the estimation error ratio PERF(I) better in this last situation. 
In the low dimensional case, it seems that increasing the number m of unlabeled data does not 
lead to an improvement of the performance of the Transductive LASSO. This is for instance 
displayed in line 4 of Table [1] where m = 200 or in lines 7 and 8 of the same table, where 
m = 100. Finally, note that when n is large, the LASSO estimator behaves in a good way 
and it becomes difficult to improve it thanks to the Transductive LASSO. 



The high dimensional case. Table [21 Figure [T] and Figure [2] summarize the results in this 
case. The main observation is that the behavior of the quantities PERF{Z), PERF(X) and 
PERF(I) highly depends on whether the sparsity index s is larger than the sample size n or 
not. Let us distinguish these two cases. 

— When s > n: in this difficult case, the performance of the Transductive LASSO varies 
with the dimension p. Indeed, for moderates p (a dimension smaller than about 100), the 
Transductive LASSO has good performance compared to the LASSO, as observed in the first 
three lines of Table [21 where n = 5 < 8 = s. Note that in this case, the improvements using 
the Transductive LASSO are particularly observed for the denoising error PERF{X), with a 
median value equal to 0.23 and 0.42 respectively when /> = 0.1 and p = 0.9 (and with a"^ = 1). 
Nevertheless, we notice that the transductive error ratio PERF[Z) is altered when the noise 
level increases. In this case the denoising error ratio PERF{X) is even more affected (see 
line 3 in Table [21). Despite this alteration, the Transductive LASSO has still a nice behavior in 
this setting. The same conclusions can be made in the experiments related to Figure [H where 
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Figure 1: Study of the distribution of PERF when p = 50, s = 20 and {n,m) = (10, 20). Top: 
study of PERF{Z); Bottom: study of PERF{X). From left to right: p = 0.1 and a^ = 1; p = 0.1 
and a'^ = 100; p = 0.9 and cr'^ = 1. 



p = 50, s = 20 and n = 10. Note that in this example, and when the parameters p = 0.1 and 
0-2 = 1, the median values of the PERF{Z) and PERF{X) are respectively 0.82 and 0.14. 
On the other hand for large p (and when s > n), it turns out that the Transductive LASSO 
does improve the LASSO estimator only when the variables are correlated. This is illustrated 
in Table [2] (line 4 and 5) when p = 1000, s = 50 and n = 20. 

— When s < n: we can consider two sub-cases, making a difference between problems with a 
large sparsity index s (in comparison to p) and the others with a small one. The last three 
columns of Table [2] summarize the performance of the methods when s is large. First, note 
that the Transductive LASSO improves poorly the LASSO in terms of the transductive er- 
ror PERF{Z) when p = 0.1 and o"^ = 1. Nevertheless, it remains satisfying when we deal 
with the prediction error PERF{X). On top of that, the Transductive LASSO, seems to 
be particularly interesting when the predictors are highly correlated (line 7 in Table [2] with 
p = 0.9), even in presence of noise (last line in Table [2] where p = 0.9 and o"^ = 25). In this 
case, increasing the correlations between variables p and the noise level a^ seems to imply 
better performance of the Transductive LASSO compared to the LASSO. On the other hand. 
Figure [2] illustrates the case where the sparsity index s is small compared to p. Here, p = 1000 
and s = 1. It turns out that the Transductive LASSO is either very useful, or useless. Indeed, 
as observed in the displayed histograms, the distribution of the quantities PERF{Z) (top) 
and PERF(X) (bottom) are mainly concentrated around (meaning very big improvement 
using the Transductive LASSO) and around 1 (meaning almost no improvement using the 
Transductive LASSO). The Transductive LASSO significantly improves the LASSO in gen- 
eral. Nevertheless the degradation of the behavior of the Transductive LASSO is here sensitive 
to the increase of a'^. One can compare for this purpose the third column in Figure [2] and the 
last line of Table [21 

In the high dimensional setting, increasing the size m of the unlabeled dataset is not advanta- 
geous to the performance of the Transductive LASSO in terms of the transductive error. This 
can be observed in the last column of Figure [2l 
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Figure 2: Study of the distribution of PERF when p = 1000 and s = 1. Top: study of PERF{Z); 
Bottom: study of PERF{X). From left to right: {n,m) = (5,20), p = 0.1 and a^ = 1; (n,m) = 
(5,20), p = 0.9 and a^ = 1; {n,Tn) = (5,20), p = 0.9 and a^ = 25; {n,m) = (5,500), p = 0.9 and 



Conclusion of the simulation study: the Transductive LASSO seems to be a good alter- 
native to the LASSO in most of the cases. It responds a good way not only to the Transductive 
objective (through PERF{Z)), but also to the denoising and the estimation ones (through 
PERF{X) and PERF {I) respectively). The Transductive LASSO is particularly useful in 
the difficult situation, that is when the variables are highly correlated. It is also often robust 
while varying the noise level. Moreover, it appears that in general, a large amount of un- 
labeled dataset m does not help to make the Transductive LASSO better than the LASSO. 
The methods works better with small values of m. Hence it turns out that more clever ways 
to exploit the unlabeled points can be imagined. For instance, one may add weights to the 
observations. More precisely, one can associate to each labeled point a weight, bigger than the 
weight set for the unlabeled points. This would be the topic of a future work. Furthermore, 
the simulation study reveals how beneficial can be the use of the unlabeled points even to 
increase the performance in the denoising task. 

Finally a surprising observation in most of our experiments is that as often as not, the mini- 
mum in 

|Z(/3^^(Ai,A2)-r)ll2<, min J|Z(/3^^(Ai,0) - /3* 



mm 

{Ai,A2)6A2 



In < min 

(Ai,0)6A2 



\l 



does not significantly depend on Ai for a very large range of values Ai. This is quite interesting 
for a practitioner as it means that in the use of the Transductive LASSO, we can reduce 
significantly the computation cost and deal (almost) with only a singular unknown tuning 
parameter (that is A2) rather than with two. 



Discussion on the regularization parameter. We would like to point out the importance 
of the tuning parameter A in a general term. Figure [3] illustrates a graph of a typical experi- 
ment in the low dimensional setting. There are two curves on this graph, that represent the 



quantities {l/n)\\X{(3^ 



Dili and (l/m)||Z(/3f' 



(3* 



li with respect to A. We observe that 
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both functions do not reach their minimum value for the same value of A (the minimum are 
highlighted on the graph by a circle and a cross), even if these minimum are quite close. Since 
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Figure 3: Evolution of the denoising error (the red solide line) and the transduction error (the 
blue dashed line) of the LASSO w.r.t. A. The minimum of the denoising and the transduction 
errors are marked respectively by a red circle and a blue cross. The best tuning parameter for 
the variable selection purpose is pointed by a vertical line. 



we consider variable selection methods, the identification of the true support {j : /3* 7^ 0} 

of the vector /3* is also in concern. One expects that the estimator /3 and the true vector 
/3* share the same support at least when n is large enough. This is known as the variable 
selection consistency problem and it has been considered for the LASSO estimator in several 
works |Bun08[ iMBOHl IMY091 IWaiOGi IZY06| . Recently, |Lou08j provided the variable selection 
consistency of the Dantzig Selector. Other popular selection procedures, based on the LASSO 
estimator, such as the Adaptive LASSO |Zou06| . the SCAD |FL01| . the S-LASSO |Heb08| and 
the Group-LASSO [BacOS], have also been studied under a variable selection point of view. 
Following our previous work |AH08| . it is possible to provide such results for the Transductive 
LASSO. The variable selection task has also been illustrated in Figure [3] by the vertical line. 
We reported the minimal value of A for which the LASSO estimator identifies correctly the 
non zero components of /3* . This value of A is quite different from the values that minimizes 
the prediction loss. This observation is recurrent in almost all the experiments: the estima- 
tion X/3*, Z/3* and the support of /3* are three different objectives and have to be treated 
separately. We cannot expect in general to find a choice for A which makes the LASSO, for 
instance, has good performance for all the mentioned objective simultaneously. 

3.3 Real data 

We apply the Transductive LASSO and the LASSO estimators to a genetic dataset, where 
the goal is to learn the complex combinatorial code underlying gene expression. These data 
have already been analyzed in |MB07| and the original source is |BT04| . The problem we 
consider here is known as motif regression |CLLL03] . By motif, we think of a sequence of 
letters consisting of A, C, G and T. The instances in this dataset are genes coming from yeast. 
More precisely, L = 2587 genes are available. Also we have p = 666 variables. Each of them 
(with length 2587) consists of scores associated to a given candidate motif and are computed. 
These scores measure how well the motifs are represented in the upstream regions of the genes. 
To summary, each row of this L x p design matrix corresponds to a gene and each column to 
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Table 3: Evaluation of the the median and the quantile of order 0.3 {MedlQ^]) of the quantities 
PERF(Z) in the high dimensional real dataset. Here n is the labeled sample size and t = m 
is the unlabeled sample size. 



n 





10 


20 


50 


100 


500 


1000 


10 


0.85 [0.67] 


0.88 [0.77] 


0.90 [0.84] 


0.97 [0.94] 


0.98 [0.97] 


0.99 [0.98] 


20 


0.74 [0.52] 


0.85 [0.70] 


0.86 [0.76] 


0.91 [0.86] 


0.96 [0.95] 


0.98 [0.97] 


50 


0.78 [0.49] 


0.68 [0.53] 


0.81 [0.67] 


0.84 [0.72] 


0.94 [0.91] 


0.97 [0.96] 


100 


0.72 [0.47 


0.68 0.47 


0.75 0.58 


0.75 0.63] 


0.87 0,84 


0.95 0.93] 


500 


0.43 [0.29 


0.51 0.34 


0.49 0.30 


0.49 0.39] 


0.81 0.73] 


0.88 0.83] 


1000 


0.96 [0.93] 


0.97 [0.92] 


0.91 [0.85] 


0.89 [0.81] 


0.88 [0.83] 


0.86 [0.80] 



Table 4: Evaluation of the the median and the quantile of order 0.3 {Med [Qs]) of the quantities 
PERF{X) in the high dimensional real dataset. Here n is the labeled sample size and t = m 
is the unlabeled sample size. 



n 





10 


20 


50 


100 


500 


1000 


10 


0.04 [0.01] 


0.010 0.003 


0.04 0.02] 


0.005 0.001 


0.0011 0.0003] 


0.12 0.10 


20 


0.030 0.004 


0.006 0.002 


0.004 0.001 


0.005 0.002 


0.003 0.001 


0.13 0.11 


50 


0.07 [0.01] 


0.028 0.008 


0.012 0.002 


0.011 0.003 


0.014 0.005 


0.15 0.13 


100 


0.32 [0.41] 


0.07 0.01] 


0.024 0.005 


0.029 0.009 


0.02 0.01] 


0.22 0.18 


500 


0.40 [0.22] 


0.44 0.25] 


0.33 0.11] 


0.24 0.08] 


0.38 0.30] 


0.51 0.46 


1000 


0.97 [0.93] 


0.97 0.94] 


0.92 [0.86] 


0.90 0.80] 


0.87 0.77] 


0.78 [0.67] 



a motif score. In other words, each component {i,j) € {1, . . . ,L} x {1, . . . ,p} of this matrix 
measures how well the j-th. motif score is represented in the upstream region of the i-th gene. 
The response vector is a vector of size L. Its i-tli component is the expression value of the 
i-th gene. Actually, 255 response vectors are available. These several measurements have 
been collected based on a time-course experiment. Then, each response vector corresponds 
to a measurement of the gene expressions at a time-point. In our study, we use only one 
response vector by experiment. Then we first pick one of the 255 time-points. According to 
the construction of the labeled and the unlabeled datasets, we choose to pick each of them 
randomly among the 2587 available instances. 

In the first experiment, we only consider the vector corresponding to the first time-point. 
Then, we construct X, Y and Z. We first pick n observations with the corresponding labels 
to construct X and Y respectively. In order to build Z, we add t = m — n other observations 
(for which we do not care about the corresponding labels) to X. The values of n and t are 
specified in Tables [3] (for transductive the error) and [J] (for the denoising error) , where the 
results for this setting are summarized. 

Most of these results confirm what has been observed in the simulation study. Indeed, we 
remark a difference in the performance of the methods in the high dimensional case and when 
p < n (we recall that p = 666). The difference between the last line, where n = 1000, and 
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Table 5: Evaluation of the mean (Mean), the median (Med) and the quantile Q3 of order 0.3 
of the quantities PERF{Z), PERF(X), when the methods are used in the high dimensional 
real dataset. 





PERF{Z) 


PERF{X) 1 


Gene 


Moy 


Med 


Q3 


Moy 


Med 


Qa 


1 


0.91 


0.94 


0.88 


0.44 


0.39 


0.25 


2 


0.90 


0.92 


0.87 


0.41 


0.42 


0.21 


Random 


0.88 


0.92 


0.83 


0.34 


0.27 


0.09 



the other lines of both Tables [3] and U] illustrates this point. Indeed, when n is large, the 
improvement using the Transductive LASSO is not that significant for both the transductive 
and the denoising errors (about 0.90). We observe a big difference with the high dimensional 
case (the lines above), where the improvement using the Transductive LASSO is to be noticed 
most of the time. Conforming to the simulation study, the performance of the Transductive 
LASSO are particularly marked for the denoising error. Indeed, PERF(X) is very low, with 
a median value between 0.001 and 0.50, as displayed in Tabled Moreover, the performance 
of the Transductive LASSO compared to the LASSO are getting better and better when n is 
small. According to the transductive error (Tables [3|), we also observe that the Transductive 
LASSO improves the LASSO estimator. Also conforming to the simulation study, it turns out 
that the improvement using the Transductive LASSO is not that significant when t (and then 
m) is large. Actually, the best case in this real dataset corresponds to the situation where n 
is large (n = 500) and t is small {t = 10), with a median value of PERF{Z) equal to 0.43. 
Another observation can be made. According to the results displayed in Tables [3l we remark 
the diagonal (with n = t) plays an important role. Indeed, the value of PERF(Z) when n = t 
is around 0.8. Moreover, when n > t the improvement is always better than 0.8 in these high 
dimensional experiments. This let us believe that the best situations for the Transductive 
LASSO here, but also in general, is when n > t. 

In all these results, we expect that the sparsity index s played a role. Indeed, we already 
have seen in the simulation experiments that the cases where n > s and those where n < s 
are different. Nevertheless, our above study does not able us to make a conclusion on an 
approached value of s. 

Let us now consider the second study. Here the way to construct X, Y and Z is the same 
as previously, excepted for the the values of n and m. Here both of them are random in 
1, . . . , L (recall that L = 2587 is the total number of the available instances) and such that 
L > m > 2n. Then it is the less advantageous situation for the Transductive LASSO. These 
results can be then associated to the upper diagonal results of Tables [3] and [H The main 
aspect of this study is that the time-point differs. Indeed, we choose the first time-point in the 
first experiment, the second in the second study, whereas we pick randomly one time-point 
for each replication in the third experiment (cf. Table [5]). 

The results are summarized in Table This study reveals that the behavior of the Transduc- 
tive LASSO compared to the LASSO remains the same for all the time-points. We observe 
that even in this real dataset, the Transductive LASSO is useful. Moreover, as expected in 
this case, the Transductive LASSO outperforms the LASSO estimator particularly in terms 
of the prediction error. 



13 



4 Theoretical results 

In this section, we consider the theoretical properties of the Transductive LASSO and Trans- 
ductive Dantzig Selector, and more generally of the estimator I3a,\ and /3a, A given respectively 
by (j2.ip and (j2.2p for any given matrix A {A= ^njmZ is then a special case). 

4.1 Assumptions 

Here, we give our two assumptions. The first one is about the matrix j4, the second one is 
about the preliminary estimator A/3*. 

Assumption i/(y4, r): there exists a constant ci^A^T^ > such that, for any a G R^ such 
that Ei:/3*=o l"il ^ 'rEi:/3;^o l"jl ' "^«5 have 

ol{J^A)a>c{A,T)n ^ a|. (4) 

First, let us explain briefly the meaning of this hypothesis. In the case where A has full rank, 
the condition 

I'A'yla > c(j4, r)n N^ a^-, 



a 



i:/3*7^0 



is always satisfied for any a E IR,^ with 0(^4, r) larger than the smallest eigenvalue of Al Ajn. 
However, for the LASSO, we have (A'A) = (X'X) and A! A cannot be invertible if p > n. 
Even in this high dimensional setting. Assumption H{A, r) may still be satisfied. Indeed, 
the assumption requires that Inequality (JH) holds only for a small for a small subset of & 
determined by the condition X^,-./3*=o 1*^71 — '^'12j-b*m\'^j\ ■ ^°^ A = X, this assumption 
becomes exactly the one taken in |BRT09| . In that paper, the necessity of such an hypothesis 
is also discussed. 

Assumption conf (A l3*,K,rj): The estimator A/S* is such that, with probability at least l — rj, 

A'(AI3* - AP) < Kajln log-, 
oo \ r] 

This assumption will be discussed for different types of preliminary estimators in Section 14.31 
However note that it always holds when A = X and A/S* = Y (that is, in the "usual" 
LASSO case). The idea of such an assumption results from the geometrical considerations in 
our previous work on confidence regions |Alq08 IAH08| . It just means that the preliminary 



estimator A/3* may be used to build a suitable confidence region for Af3* . 

4.2 Main results 

First, Theorem 14.11 below states that the estimator Pa,\ satisfies a Sparsity Inequality with 
high probability. A particular consequence of this result is the fact that the Transductive 
Dantzig Selector /3 /^^ , satisfies a similar SI and responds to the transductive objective. 
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Theorem 4.1. Let us assume that Assumption H{A,1) and Assumption conl {A f3*,K,r]) are 

satisfied. Let us choose 

I P 

A = na . 2n\og — , 

V V 

for some rj g]0, 1[. Then, with probability at least 1 — rj, we have simultaneously 

'< ^'^:.":.'" iok( 



A(Pa,x-P' 



^ 8AtV||/3*||o ^ (P 

2- C(A1) ^°^ 



and 



Pa,x - /3* 



< 



2V2Ata||/3*||o /log(p/r?) 



c(A, 1) V n 

We remind that all the proofs are postponed to Section [6l page [T71 One can use this result 
to tackle the particular transductive task. This is the aim of Corollary 14.21 

Corollary 4.2. Let A be defined as in Theorem \4-l\ Under Assumption H{y/~^Z, 1) and 

Assumption conf(y^ Z/3*,K,r]), we have with probability 1 — r/ 



1 
m 



(K 



n/mZ,\ 



13* 



< 



SK-^a 



2^211 o*| 



log(p/r?) 



c{y/n/mZ,l) n 

Based on Theorem 14.11 a proper choice of the matrix A can also make us respond to the 
other objectives (denoising and estimation) we considered in Section 12.21 Indeed, in those 
cases we obtain: 

• Under Assumption H{X,1) and Assumption conf(X/3*, k, r/) and with probability at 
least 1 — r] 

^(o .A 2^ 8kV||/31|o ^ (p\ 

^(^^■^-^) 2^ c{X,l) ^°gUJ' 

Under Assumption conf(-y/nI/3*, K,r/) and with probability at least 1 — ry 



f^^l \- P* 



< 



^^V||/3*||o 



n 



log( - 



Corollary 14. 21 and the above statements claim that each estimator perform well for the task it is 
designed to fulfill. In a similar way, we finally can establish analog results for the Transductive 
LASSO and more generally for the estimator (3 a a given by Definition 12.11 



Theorem 4.3. Let us assume that assumption I{{A,'i) and Assumption conf(^/3*, k, r/) are 
satisfied. Let us choose 

'p 



A = 2k(T4 / 2nlog - 
V V^ 

for some r] g]0, 1[. Then, with probability at least 1 — r], we have simultaneously 

72cj2K2iifl*i 



\\A{i3A,x-nr2< 



10, P 



and 



P* - Pa,x 



< 



24^/2||r||o l logip/v) 
c(A,3) V n 
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4.3 Examples of preliminary estimators 

In this section, we examine some preliminary estimators A/3* and check if they may satisfy 
Assumption conf(^/3*, k). This is an important issue of the paper since it helps to understand 
how restrictive are the assumptions in the results of Section 14.21 The first example deals with 
the (generalized) least square estimator. 

Theorem 4.4. Let us choose (X'X) any pseudo-inverse of (X'X) and let us set 

Afi* = A{X^) X'Y, 

as preliminary estimator. Then, under the assumption Ker(A) = Ker(X) and for any r] g]0, 1[, 
Assumption coni {Af3*,K,r]) holds with k = / ^-.^ where $7 = {{A'A){X'X) {A'A))/n. 

According to Theorem 14.41 the standard case of interest is when A = X. The preliminary 
estimator becomes X/3* = Y and we obtain that conf(y, l,ry) holds. Plugging this into The- 
orems UT] and US] implies the theorems about the LASSO and the Dantzig Selector provided 
in |BRT09j . Moreover, other choices for A and X/3* are possible which able us to deal for 
instance with the transductive setting. Hence, one can interpret Theorem 14.41 together with 
Theorems 14.11 and 14.31 as a generalization of the result in |BRT09| . 

To introduce the second preliminary estimator, let us consider the case when A ^ X. Then 
the assumption Ker(A) = Ker(X) is restrictive when p > n {in the somehow appreciable 
case p < n, the assumption holds since both X and Z may have full rank). If the relation 
Kei^A) = Ker(X) is not satisfied, as the construction of Z leads to Ker(Z) C Ker(Ar), we 
may suggest the following alternative. Consider the restriction of the estimation procedure 

to the span of X. That is, let replace Z by Zx = {X'X) {X'X)Z. Then the assumption 
Ker(Zx) = Ker(X) is satisfied. As a consequence, with probability at least 1 — ??, the following 
inequality 



2 

is obtained for instance for the Transductive Dantzig Selector (an analog inequality can 
be written for the Transductive LASSO), under Assumption H{y^n/mZx, 1) and with the 
same choice of the tuning parameter A as in Theorem 14.11 Finally, let us remark that 
{Z — Zx)0 i—i—r7 > = and conclude the following result. 



Corollary 4.5. Under Assumption H{-\Jn/mZxi 1) and with the same choice of A as in 
Theorem \4-l\ we have with probability at least 1 — r], 

2 c{^/n/mZx,l) \VJ 






The conclusion figured out this result is quite intuitive: when ||(Z — Zx)/?*]]! is large, 
the information in X is not sufficient to estimate ZP* . But, if \\{Z — Zx)P*\\2 is small, the 
Transductive Dantzig Selector based on Zx has good performances. This assumption has the 
same status as a regularity assumption in a non-parametric setting. Obviously, we cannot 
know whether \\{Z — Zx)/3*||2 is small or not. However when it is not, it seems impossible to 
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guaranty a good estimation. 

The final preliminary estimator we examine here has also been studied in the experiments 
part (cf. Section [3]). Let us consider the Dantzig Selector as preliminary estimator. Here, a 
quite natural assumption can be made. It somehow says that X' X and A A are not too far 
from each other. 

Theorem 4.6. Let us assume that, there is a constant k > such that for any u € IR^ with 

ikiii<2|iriii, ^ 

II [{X'X) - {A' A)] u\\^ < kaV2nlog{p). 
Let moreover r] g]0, 1[ and set the preliminary estimator 



AI3* = Ap 



DS 



2aJ2n\ 



Then Assumption conf ( Af]* , k, r/ 1 is true with k = 4 + /c. 



The same result would hold as well for the LASSO as a preliminary estimator. Moreover, 
in this last result, one can also consider the transductive objective and consider the matrix 
A = y^n/mZx as introduced above. Such a choice helps us to provide good theoretical 
guaranties with very mild assumptions on the Gram matrix X'X. 

5 Conclusion 

In this paper, we studied transductive versions of the LASSO and the Dantzig Selector. These 
new methods appeared to enjoy both theoretical and practical advantages. Indeed, in one 
hand, we showed that the Transductive LASSO and Dantzig Selector satisfy sparsity inequal- 
ities with weaker assumption on the Gram matrix than the original method. On the other 
hand we displayed some experimental results illustrating the superiority of the Transductive 
LASSO on the LASSO. On top of that, these transductive methods are easy to compute. 
The experimental study reveals that the Transductive LASSO is often much better than the 
original LASSO. Nevertheless, when the number of unlabeled observations is much larger than 
the sample size, it turns out the the gain using the Transductive LASSO is reduced. We will 
focus on this point in a future work. 

6 Proofs 

In this section, we give the proofs of our main results. 



6.1 Proofs of Theorems 14.11 and 14.31 

Proof of Theorem \4 -11 First, we have obviously 



A 



{Pa,x-(3* 



< 



^A,x - /3* ) A' A (^A,x -f3*)< h,x - P* ^A (Pa,x - /3* 



/3a,a - /3* 



J II A' [a~Pa,x - M*) L + 11^' {^* - ^/5*) LI • (5) 
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Then, just remark that by Assumption conf {Af3*, k), we have, with probabihty at least 1 —rj, 

A'(AP*-A^*) < Ka J 2nlog-. (6) 

oo y ?7 

Moreover, by the definition of f3A,x (Definition 12.21 page d]) we have 



A' A^A,x-Af]^ 



< A. 



Then, combining the fact that /3a A minimizes || • ||i among all the vectors /? satisfying 



A' (A^- AP* 



<A, 



and the fact that thanks to ([6]) and as soon as A = na^ 2n\og -, the vector /?* satisfies the 
same inequality, we have 



< ii/3iii - whAi < E \^j\ - E i(^AA)ii - E i(/3- 



A,X)j\ 



/3,V0 



/3*^0 



p*=o 



< E 1/^1 -(/^aa).-i-E 1/^1 -(/^^^ 



x)j\- 



/3;^o 



/3;=o 



As a consequence, we have 



^\/3*-0A,x)j\< ei/3;-(/^a,a)jI, 



/3|=0 



13*7^0 



which implies that the vector /3* — /3a, A is an admissible a for the relation in Assump- 
tion H(A, 1). Hence, using this assumption in the last above inequality, we have the following 
upper bound 



Pa,x - P' 



^ = E 1/3; - (/3a,a),| + E 1/^1 - (/5aa).-| < 2 E 1/5; - (/3a, 



Ajjl 



/3;=o 



/3,V0 



/3,V0 



< /cardii : /?* / 0} E H^j " (^A,a),)' < 



nc[A, 1) 
We plug this result into Inequality ([5]) to obtain, with probability at least 1 — ij, 



(3*^0 



P(/3A,A-/3*)||i<2^aJ21og{^ 



P 



that leads to 



\\A{f^A,x-P*m< 



Vj c{A,l) 
8kV2 „ ^ 



P(/3A,A-/3*)||i, 



ciA,l) 



log 



P 



Plugging this last inequality into Inequality ([7]) gives 



^A,A - /3* 



< 



2^/2||/3* 
c(Al) 



1 



n 



and this ends the proof. 
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Proof of Theorem \4-3\ By the definition of the transductive LASSO (Definition 12.11 page |4]) 
we have 

-2A^*'a^a,x + ^'a,xA'APa,x + 2A||/3a,a||i < -2Ap*' Ap* + (/3*)'A'A/3* + 2A||/3*||i. 
We can rewrite that as 

- 2(/3*)'A'A/3a,a + 2 [a/3* - ^*]' A/3a,a + /3^,a^'^/5aa + 2A||/3^,a||i 

< -(/3*)'A'A/3* + 2^/3*-^*]' A/3* + 2AII/3* 111, 
or, rearranging the terms, 

\\A0A,x - nWl = Wax - P*)A'A0A,x - PI 



< 2 



A/3* -A/3* A /3*-/34A +2A 



II - l|PA,A||l 



(8) 



Now, let us remark that 



A/3* -A^* A ( /3* - $A,x) = A' (A/3* - A^* (/3* - $a,x 



< 



A' A/3* -A/3* /3*-/3a,a 



A 
< - 

1 - 2 



/3* - Pa,x 



with probabiUty 1—??, provided that A = 2k(T^ /2n log - together with Assumption conf (A/3*, k) 
We plug that into Inequality ([8]) to obtain, with probability 1 — r], 



||A(/3^,a-/3*)||2<A 
This leads to 

\\A0A,x-n\\l + ^ l3*-^A,x 



P* - /3a,a 



+ 2 



1 - \\PA,X\\l 



< 2A 



/3*-/5aa +||/5*||i-||/3a,a||i 



= 2AE (l^i - (^AA)i + 1/3*1 - (/3a,a), j = 2A J; (|/3* - 0A,x)j\ + |/3;| - (/3a,a), 

^4A j; (|/3;-(/3a,a),|), (9) 

and, from this Inequality ([2]), we deduce that jiA,x ~ /3* is an admissible a vector in Assump- 
tion /7(A, 3). Then we obtain, still from ([9]) and Assumption i:f(A,3), 



\\A0A,x- nWl < 3A J^ (|/3* - (/3^,a), 



< 6Kcr 



\ 



2nlog(^)||/3*||o5] (/3*-(/3aa). 



< 6K(7'1 



^ c(A,3) 



■P(/3AA-/3*)lli. 
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This last display implies 



72a^K^||r||olog £ 
\\A{PA,x-n\\l<- ^ 



c(A3) 



We plug this last result into Inequality ([9]) to obtain 



/?* - /3a,a 



< 



24^/211/3* llo 
c(A3) 



\ 



log(^ 



n 



D 



6.2 Proofs of Theorems 14.41 and 14.61 

Proof of Theorem 14- -41 The proof is quite simple. As y ~ M{Xj3* ,a'^In), we have 



and so 



(JCX) X'Y - /3* ~ AA ( 0, a^ (X^) 



Aa[{X'X) X'F-/3* ~AA(0,a2jl), 



where i7 denotes the matrix il = $7(^,X) = ^'A(X'X) A' A. Let us also define 

v = a'a{{x^) X'Y-P*]. 



Then, for any j G {1, . . . ,p}, 

Using a standard inequality on the tail of Gaussian variables yields 

2^ 



, ^ , KaV2nlog(p/7?) I \ / f^-^nlo^ivlv) 

Vj\ > Kay/2nlog{p/r])) < exp | -^ ^^-^ ^ | = exp ' ^^^"^^^/^/^ 



2a2^., 







J J 



Then, using a union bound and the concavity of the function x i— )• exp(— x), we easily obtain 



Ml^lloo > Kay^2nlog{p/r])j < ^exp 



i=i 



K'^nlog{p/r]) 



^ ^j,j 



P ,.2 



1 Y^ K n\og{p/r]) 



< pexp I — ^ 



fi. 



The above quantity F ( ||V^||oo > KCTy^2nlog{p/r]) 1 is smaller than r], if the parameter k is such 
that 



pexp 



~E 



P .,.2 



1 Y^ K n\og{p/r]) 



i=i 



[7. 



or equivalently k 



- ^„ ^ i-. This is the announced result. 
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Proof of Theorem \4.6\ We have, for any /i > 0, 



A'A(^f^'^-(3* 



< 



{A'A-X'X) [^1^^ -13* 



+ 



x'xl^l: 



jLASSO 



/3^ 



Now, for the Dantzig Selector, 



X'X(/3^^-/3^ 



<2/x, 



with probabihty at least 1 — rj, provided that fi = 2ay^2nlog{p/7]). Moreover, 



^r-rili<ll/3nii + ll/3*lli<2||/3*l|i, 



implies that 



{A' A - X'X) (^/3^^ - /3* j ^ < kay/2n\og{p). 
As a conclusion, with probability 1 — rj, 



A'Ai^j^^ - f]*) < 4a^/2nlog{p/r]) + ka^/2nlog{p) < (4 + k)a^/2nlog{p/r]). 
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